Skip to content

[sync] upstream llm-d main branch 8be94cd [2026-04-13]#64

Open
zdtsw wants to merge 10 commits intoopendatahub-io:mainfrom
zdtsw-forking:sync/upstream-8be94cd
Open

[sync] upstream llm-d main branch 8be94cd [2026-04-13]#64
zdtsw wants to merge 10 commits intoopendatahub-io:mainfrom
zdtsw-forking:sync/upstream-8be94cd

Conversation

@zdtsw
Copy link
Copy Markdown
Member

@zdtsw zdtsw commented Apr 13, 2026

Syncs llm-d/llm-d-workload-variant-autoscaler main branch into ODH main branch.

Upstream commit: llm-d@8be94cd

Generated by #41

Upstream commits included:

Summary by CodeRabbit

  • New Features

    • Added OpenShift platform support to benchmarking workflows with manual workflow dispatch capability
    • Added prefill-heavy workload benchmark test with metrics collection and analysis
    • Added HPA and KEDA integration sample configurations
  • Documentation

    • Reorganized documentation with updated user-guide structure
    • Removed deprecated tutorial content; added developer benchmarking guides
    • Updated configuration and scale-from-zero prerequisites
  • Improvements

    • Enhanced benchmark infrastructure with EPP configuration management and Gateway connectivity verification
    • Improved error handling and test utilities
    • Updated deployment configuration options for KV cache and queue thresholds

mamy-CS and others added 10 commits April 3, 2026 10:52
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
The llm-d/llm-d repo already runs `Nightly - WVA E2E (CKS)` using the
same reusable workflow, same cluster, and same namespace. This copy has
never passed (30/30 failures) and wastes runner time.

Signed-off-by: Andrew Anderson <andy@clubanderson.com>
Signed-off-by: Mohammed Abdi <mohammed.munir.abdi@ibm.com>
…e metrics server in tests, since it's not needed for indexer testing. (llm-d#976)

Signed-off-by: greg pereira <grpereir@redhat.com>
Signed-off-by: Braulio Dumba <Braulio.Dumba@ibm.com>
- docs/user-guide/LeaderWorkerSet-support.md: shoud -> should
- internal/engines/analyzers/queueingmodel/tuner/utils.go: expextedObservations -> expectedObservations (4 occurrences)

Note: PNGs in test/benchmark/config.go is correct (plural of PNG), not a typo.

Closes llm-d#972
Resolve all linter violations reported by golangci-lint-v2, including:

- perfsprint: replace fmt.Errorf with errors.New for static strings,
  fmt.Sprintf("%d",...) with strconv.Itoa, and fmt.Sprintf with string
  concatenation where applicable
- goconst: extract repeated string literals into package-level constants
- gocritic: rewrite if-else chains as switch statements, fix captLocal,
  assignOp, appendAssign, singleCaseSwitch, elseif, deprecatedComment
- ginkgolinter: replace len(x) assertions with HaveLen/BeEmpty
- prealloc: pre-allocate slices where capacity is known
- gofmt: auto-fix formatting
- loggercheck: fix odd key-value pairs in structured logging calls
- revive: fix comment spacing (//TODO -> // TODO, /** */ -> //)
- fatcontext: add nolint directives for test suite setup patterns
- unparam: handle unused parameters
- copyloopvar: remove unnecessary tc := tc copies (Go 1.22+)
- misspell: fix throughtput -> throughput
- unconvert: remove unnecessary type conversions
- staticcheck: add .golangci.yml exclusion for deprecated Accelerator field

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Lionel Villard <villard@us.ibm.com>
# Conflicts:
#	docs/user-guide/scale-from-zero.md
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

This pull request introduces OpenShift platform support for the CI benchmark workflow with manual trigger capability, adds a new .golangci.yml linter configuration, and implements extensive code quality improvements across the codebase. Major additions include a new prefill-heavy benchmark test suite (~900 lines), updated deployment scripts for nightly installations supporting both CKS and OpenShift, and new test utilities for Kubernetes job orchestration and EPP configuration. The PR also refactors error handling from formatted errors to sentinel-style errors, replaces string formatting calls with direct concatenation/strconv conversions, and reorganizes documentation links and structure. Multiple configuration files, test fixtures, and API types receive updates for consistency and efficiency (slice pre-allocation, constant definitions for repeated strings).

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes


Issues

CI Pipeline Security: The .github/workflows/ci-benchmark.yaml workflow_dispatch input accepts model_id without validation constraints. Verify that arbitrary model IDs cannot be supplied to exfiltrate private models or bypass restrictions intended for the issue-comment trigger path.

Error Context Loss: Widespread conversion from fmt.Errorf to errors.New (e.g., internal/config/loader.go, internal/config/scale_to_zero.go, pkg/solver/optimizer.go) removes error context/wrapping. This pattern reduces debuggability when errors propagate through multiple layers. Sentinel-style errors are appropriate for specific, well-known failure modes; consider whether each instance truly requires loss of context. Document the rationale if intentional.

Unterminated Timeout in test/benchmark/epp_config.go:PatchEPPConfigMap: The function polls the EPP deployment status with a 5-minute timeout via wait.PollUntilContextTimeout. If the deployment rollout never completes (stuck/crash-loop), the benchmark will hang and timeout without clear failure messaging. Return explicit error with last-observed deployment status.

Missing Metrics Validation in test/benchmark/prefill_heavy_benchmark_test.go:

  • The PrefillResult struct is populated with zero values when GuideLLM metrics extraction fails (missing === BENCHMARK JSON === marker). No validation that TTFT, ITL, Throughput fields contain valid JSON; callers cannot distinguish "no metrics collected" from "collection failed."
  • JSON written to /tmp/prefill-benchmark-results.json is not validated for schema compliance before persistence.
  • Consider returning an error or explicit MetricsValid boolean flag in the result struct.

Gateway Connectivity Verification Fragility in test/benchmark/job_waiter.go:VerifyGatewayConnectivity:

  • The function checks that pod logs contain "OK" string, but does not validate the HTTP response structure or status code.
  • If the Gateway returns an error response that happens to include "OK" in error text, the check will incorrectly pass (CWE-347: Improper Verification of Cryptographic Signature).
  • Parse and validate the actual JSON response from the Gateway.

Documentation Link Integrity: Multiple .md files have internal links rewritten from ../../docs/ to relative paths (e.g., docs/user-guide/configuration.md, docs/user-guide/hpa-integration.md). Verify all link targets remain reachable after reorganization; particularly, removed files (docs/tutorials/parameter-estimation.md, docs/developer-guide/agentic-workflows.md) have backlinks that should be cleaned up or redirected.

Slice Reuse Risk in internal/discovery/k8s_with_gpu_operator_test.go: The new code correctly creates a fresh slice (make([]runtime.Object, 0, len(nodes)+len(pods))) instead of reusing the nodes slice, which prevents unintended mutation. Verify similar patterns in test/benchmark/scale_up_latency_benchmark_test.go and test/chart/client_only_install_test.go have equivalent safeguards.

🚥 Pre-merge checks | ✅ 2
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately reflects that this PR syncs upstream commits into the main branch, with a specific commit hash and date. It is concise and clearly communicates the primary purpose.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (11)
test/utils/pod_scraping_test_helpers.go (1)

555-566: ⚠️ Potential issue | 🟠 Major

Use io.LimitReader for unbounded pod log reads — CWE-400 (Uncontrolled Resource Consumption).

Line 565 uses DoRaw(ctx) without size constraints, allowing arbitrary log payloads to exhaust heap memory. A pod emitting large logs (intentionally or through malfunction) will OOM the test process.

Proposed fix
 import (
 	"context"
 	_ "embed"
 	"fmt"
+	"io"
 	"net/http"
 	"time"
@@
-	logBytes, err := logsReq.DoRaw(ctx)
+	logStream, err := logsReq.Stream(ctx)
+	if err != nil {
+		g.Expect(err).NotTo(gom.HaveOccurred(), "Should be able to stream job logs")
+	}
+	defer func() {
+		_ = logStream.Close()
+	}()
+
+	const maxLogBytes = 1 << 20 // 1 MiB cap
+	logBytes, err := io.ReadAll(io.LimitReader(logStream, maxLogBytes))
 	g.Expect(err).NotTo(gom.HaveOccurred(), "Should be able to get job logs")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/pod_scraping_test_helpers.go` around lines 555 - 566, The pod log
read using logsReq.DoRaw(ctx) is unbounded and can OOM; change to a streaming
read via logsReq.Stream(ctx), obtain the io.ReadCloser (e.g., stream), wrap it
with io.LimitReader(stream, <MAX_BYTES>) to cap the bytes (choose a sensible
MAX_BYTES like a few MBs), read from that limited reader into your logBytes, and
ensure you close the stream; update references to logsReq and logBytes
accordingly so tests still assert on the returned bytes.
test/benchmark/grafana.go (1)

80-86: ⚠️ Potential issue | 🟡 Minor

InsecureSkipVerify: true disables TLS certificate verification (CWE-295).

Per coding guidelines, InsecureSkipVerify enables MITM attacks. While this is test/benchmark code connecting to a localhost port-forward, consider adding a code comment documenting why this is acceptable here (local-only traffic through kubectl port-forward to in-cluster Grafana).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/grafana.go` around lines 80 - 86, The code sets
TLSClientConfig.InsecureSkipVerify = true on the httpClient which disables TLS
cert verification; instead of changing behavior, add a clear inline comment next
to the TLSClientConfig/InsecureSkipVerify setting (or above the httpClient
construction) explaining that this is intentionally allowed because the
benchmark/test uses localhost kubectl port-forwarding to an in-cluster Grafana
instance (local-only traffic), and note any constraints (e.g., only for
tests/benchmarks, not production) so reviewers understand the risk is accepted
for this context.
internal/collector/source/pod/pod_scraping_source.go (1)

293-296: ⚠️ Potential issue | 🟠 Major

Missing io.LimitReader on HTTP response body — potential memory exhaustion (CWE-400).

The response body from pod metrics endpoints is read without size limits. A compromised or malicious pod could return an arbitrarily large response, causing memory exhaustion in the controller.

Proposed fix
+const maxMetricsResponseSize = 10 * 1024 * 1024 // 10MB limit for metrics response
+
 // parsePrometheusMetrics parses Prometheus text format into MetricResult.
-func (p *PodScrapingSource) parsePrometheusMetrics(reader io.Reader, podName string) (*source.MetricResult, error) {
+func (p *PodScrapingSource) parsePrometheusMetrics(reader io.Reader, podName string) (*source.MetricResult, error) {
+	limitedReader := io.LimitReader(reader, maxMetricsResponseSize)
-	parser := expfmt.NewTextParser(model.UTF8Validation)
-	metricFamilies, err := parser.TextToMetricFamilies(reader)
+	parser := expfmt.NewTextParser(model.UTF8Validation)
+	metricFamilies, err := parser.TextToMetricFamilies(limitedReader)

As per coding guidelines: "Use io.LimitReader for HTTP response bodies (prevent memory exhaustion)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/collector/source/pod/pod_scraping_source.go` around lines 293 - 296,
The code reads the HTTP response body directly into parsePrometheusMetrics (call
site: p.parsePrometheusMetrics(resp.Body, pod.Name)), which risks memory
exhaustion; wrap resp.Body with io.LimitReader using a sane max bytes constant
(e.g., maxMetricsBodyBytes) and pass that limited reader to
parsePrometheusMetrics instead, ensure the response body is still closed
(resp.Body.Close()) after reading, and optionally return a clear error if the
reader hits the limit so callers know the metrics were truncated.
test/utils/e2eutils.go (1)

209-223: ⚠️ Potential issue | 🟠 Major

Do not suppress all TLS secret creation failures.

createTLSCertificateSecret() still returns nil when kubectl create secret tls fails for reasons other than "already exists". InstallPrometheusOperator() will keep going and fail later with a much less obvious root cause.

🐛 Proposed fix
-func createTLSCertificateSecret() error { //nolint:unparam // error return kept for interface consistency
+func createTLSCertificateSecret() error {
 	certFile := "hack/tls-certs/prometheus-cert.pem"
 	keyFile := "hack/tls-certs/prometheus-key.pem"

 	cmd := exec.Command("kubectl", "create", "secret", "tls", "prometheus-tls",
 		"--cert="+certFile,
 		"--key="+keyFile,
 		"-n", monitoringNamespace)

 	if _, err := Run(cmd); err != nil {
-		// Secret might already exist, which is fine
-		fmt.Println("TLS secret already exists or creation failed (this is usually OK)")
+		if strings.Contains(err.Error(), "already exists") {
+			return nil
+		}
+		return fmt.Errorf("failed to create prometheus TLS secret: %w", err)
 	}

 	return nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/e2eutils.go` around lines 209 - 223, The
createTLSCertificateSecret function currently swallows all errors from running
kubectl; change it to check the returned err from Run(cmd) and only ignore it
when it indicates the secret already exists (e.g., error text contains "already
exists"); for any other error return that error to the caller so
InstallPrometheusOperator can fail fast. Update the error handling around the
Run(cmd) call in createTLSCertificateSecret() (and keep the function signature)
to log the specific failure and return the non-"already exists" error rather
than always returning nil.
deploy/lib/llm_d_nightly_install.sh (1)

77-80: ⚠️ Potential issue | 🟡 Minor

Fixed 3-second sleep after Job delete is fragile.

The time.Sleep(3 * time.Second) after deleting the previous Job assumes deletion completes within 3 seconds. Under resource pressure, this may not hold, causing the subsequent Create to fail with "already exists".

Proposed fix: poll for Job absence
-	_ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{
-		PropagationPolicy: ptr.To(metav1.DeletePropagationBackground),
-	})
-	time.Sleep(3 * time.Second)
+	_ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{
+		PropagationPolicy: ptr.To(metav1.DeletePropagationForeground),
+	})
+	// Wait for Job to be fully deleted
+	_ = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, true, func(ctx context.Context) (bool, error) {
+		_, err := k8sClient.BatchV1().Jobs(namespace).Get(ctx, jobName, metav1.GetOptions{})
+		if err != nil && strings.Contains(err.Error(), "not found") {
+			return true, nil
+		}
+		return false, nil
+	})

Wait - this comment is for the wrong file. Let me correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/llm_d_nightly_install.sh` around lines 77 - 80, Replace the
fragile time.Sleep(3 * time.Second) after deleting the Job with a polling loop
that waits until the Job no longer exists before calling Create: after calling
Delete (e.g., jobClient.Delete or DeleteJob), repeatedly attempt to Get the Job
(or call jobClient.Get/GetJob) with a small backoff and timeout (e.g., check
every 500ms up to a configurable deadline) and break when a NotFound error is
returned; if the deadline elapses return or propagate an error instead of
proceeding to Create (e.g., jobClient.Create or CreateJob). Ensure you only
proceed to Create once the Get returns a NotFound to avoid the "already exists"
race.
internal/controller/configmap_helpers.go (1)

111-115: ⚠️ Potential issue | 🟠 Major

isNamespaceExcluded fails open on namespace read errors (CWE-285, CWE-754).

On API/RBAC read failure, the function returns false, effectively treating the namespace as not excluded. Exploit scenario: temporary permission/API faults can bypass exclusion controls.

Suggested fix
 import (
 	"context"
 
 	"github.com/go-logr/logr"
 	yaml "gopkg.in/yaml.v3"
 	corev1 "k8s.io/api/core/v1"
+	apierrors "k8s.io/apimachinery/pkg/api/errors"
 	"sigs.k8s.io/controller-runtime/pkg/client"
@@
 	var ns corev1.Namespace
 	if err := c.Get(ctx, client.ObjectKey{Name: namespace}, &ns); err != nil {
-		// If namespace doesn't exist or we can't read it, default to not excluded
-		// This is safe - we'll proceed with normal logic
-		return false
+		// Fail closed on unexpected read errors for exclusion checks.
+		// Namespace not found is treated as not excluded.
+		return !apierrors.IsNotFound(err)
 	}

As per coding guidelines, "K8S CONTROLLER SECURITY: Proper error handling for security-critical operations."

Also applies to: 122-123

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/configmap_helpers.go` around lines 111 - 115, The
isNamespaceExcluded function currently treats any client.Get error as "not
excluded" (failing open); change it to fail closed: when calling c.Get(ctx,
client.ObjectKey{Name: namespace}, &ns) (and the similar call at the other
site), return false only if the error is a NotFound (use
apierrors.IsNotFound(err)); for any other error log the failure with context
(namespace and error) and return true to enforce exclusion on
read/permission/API failures. Ensure you reference the function
isNamespaceExcluded and update both occurrences noted in the diff.
internal/controller/predicates.go (1)

215-225: ⚠️ Potential issue | 🟠 Major

Fail-open on namespace lookup can bypass exclusion policy (CWE-285, CWE-754).

If namespace Get fails, the predicate continues and may reconcile resources from excluded namespaces. Exploit scenario: transient API/RBAC failures let excluded namespaces pass policy checks.

Suggested fix
 		if namespace != "" {
 			var ns corev1.Namespace
 			// Use background context for predicate (no cancellation needed)
-			if err := k8sClient.Get(context.Background(), client.ObjectKey{Name: namespace}, &ns); err == nil {
-				annotations := ns.GetAnnotations()
-				if annotations != nil {
-					if value, exists := annotations[constants.NamespaceExcludeAnnotationKey]; exists && value == constants.AnnotationValueTrue {
-						// Namespace is excluded - filter out this VA
-						return false
-					}
-				}
-			}
-			// If namespace fetch fails, proceed with other checks (fail open)
+			if err := k8sClient.Get(context.Background(), client.ObjectKey{Name: namespace}, &ns); err != nil {
+				// Fail closed for exclusion policy checks
+				return false
+			}
+			annotations := ns.GetAnnotations()
+			if annotations != nil {
+				if value, exists := annotations[constants.NamespaceExcludeAnnotationKey]; exists && value == constants.AnnotationValueTrue {
+					// Namespace is excluded - filter out this VA
+					return false
+				}
+			}
 		}

As per coding guidelines, "K8S CONTROLLER SECURITY: Proper error handling for security-critical operations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/controller/predicates.go` around lines 215 - 225, The current
predicate fails open when k8sClient.Get(...) for the namespace returns an error,
which can let excluded namespaces pass; change the logic in the predicate that
calls k8sClient.Get to treat any Get error as a failure-closed case: if
k8sClient.Get(ctx, client.ObjectKey{Name: namespace}, &ns) returns an error, log
the error and return false (exclude the resource) instead of continuing; keep
the existing annotation check using constants.NamespaceExcludeAnnotationKey and
constants.AnnotationValueTrue when Get succeeds, and explicitly handle
apierrors.IsNotFound if you want a different behavior (otherwise treat NotFound
as excluded as well).
internal/engines/saturation/engine_v2.go (1)

109-118: ⚠️ Potential issue | 🟠 Major

Reject unsupported enabled analyzers instead of silently zeroing their weight.

Any enabled analyzer whose name is not interfaces.SaturationAnalyzerName now contributes nothing to baseResult.Score, which can flatten model priority and change allocation order without any signal. Either keep explicit handling for every supported analyzer or fail fast on unknown enabled names.

Suggested fix
  totalWeighted := 0.0
  for _, aw := range config.Analyzers {
    if aw.Enabled != nil && !*aw.Enabled {
      continue
    }
-   if aw.Name == interfaces.SaturationAnalyzerName {
-     totalWeighted += baseResult.RequiredCapacity * aw.Score
-     // future: add "throughput", "slo" cases
-   }
+   switch aw.Name {
+   case interfaces.SaturationAnalyzerName:
+     totalWeighted += baseResult.RequiredCapacity * aw.Score
+   default:
+     return nil, fmt.Errorf("unsupported analyzer %q in V2 scoring", aw.Name)
+   }
  }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engines/saturation/engine_v2.go` around lines 109 - 118, The loop
over config.Analyzers currently ignores any enabled analyzer whose name !=
interfaces.SaturationAnalyzerName, which silently zeroes their contribution;
update the logic in the function containing that loop (the code that computes
the weighted score using totalWeighted, baseResult.RequiredCapacity and
config.Analyzers) to reject unknown enabled analyzer names instead of skipping
them—when aw.Enabled is true and aw.Name is not
interfaces.SaturationAnalyzerName return or surface an error (or explicitly log
and fail fast) so callers know an unsupported analyzer was configured; keep the
existing handling for interfaces.SaturationAnalyzerName and explicitly enumerate
any future supported analyzer names in the same check.
docs/user-guide/keda-integration.md (1)

236-314: ⚠️ Potential issue | 🟠 Major

Remove unsafeSsl: "true" from the primary KEDA example.

This bakes TLS verification bypass into the default path. That is CWE-295: a compromised in-cluster proxy, service, or network path can spoof Prometheus responses and drive arbitrary scale decisions. Keep the secure example as the default, and move any verification-bypass workaround into an explicitly labeled local-dev troubleshooting note.

Suggested fix
-      unsafeSsl: "true"                    # Skip SSL verification for self-signed certificates
+      unsafeSsl: "false"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/keda-integration.md` around lines 236 - 314, The KEDA example
currently includes unsafeSsl: "true" under the prometheus trigger metadata which
bypasses TLS verification; remove the unsafeSsl field from the primary
ScaledObject sample (look for the triggers -> type: prometheus block and its
metadata including serverAddress, query, threshold, activationThreshold,
metricType) so the default example enforces TLS verification, and instead add a
clearly labeled local-dev/troubleshooting note elsewhere that explains how to
temporarily set unsafeSsl for self-signed certs (do not keep it in the default
config).
docs/user-guide/configuration.md (1)

163-167: ⚠️ Potential issue | 🟠 Major

Don't claim immutable ConfigMaps prevent malicious changes.

This overstates the guarantee. immutable: true blocks updates, but anyone who still has delete/create on ConfigMaps can replace the object and change controller settings anyway. Reword this to say immutability only prevents in-place edits and must be paired with RBAC that denies delete/recreate.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/configuration.md` around lines 163 - 167, The bullet claiming
immutable ConfigMaps "Protects against malicious modifications" overstates
guarantees; update the wording to state that setting immutable: true only
prevents in-place edits (it blocks updates to the existing object) and does not
stop actors with delete/create permissions from replacing the ConfigMap, and add
a note that immutability must be paired with RBAC that denies delete/recreate to
prevent replacement; modify the "Protects against malicious modifications"
bullet and/or add a line under "Security Benefits" to reflect this limitation
and required RBAC pairing.
internal/engines/saturation/engine.go (1)

251-275: ⚠️ Potential issue | 🟠 Major

Honor EnableLimiter per effective namespace config.

enableLimiter is taken only from the global "default" saturation config, but the V2 and queueing-model paths later build requests from namespace-local configs. If one namespace enables limited mode and the global default does not, that tenant still goes through the unrestricted optimizer for the entire cycle, so GPU constraints are silently skipped.

Either make limiter selection controller-global and reject namespace overrides, or partition requests by effective limiter mode and optimize each group separately.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engines/saturation/engine.go` around lines 251 - 275, The current
code reads enableLimiter only from the global "default" config (via
e.Config.SaturationConfig() and cfg.EnableLimiter) but later uses
namespace-local configs when building V2/queueing-model requests, so limiter is
not honored per-namespace; update the optimizer-selection logic so it respects
the effective per-namespace limiter mode instead of a single global flag: either
(A) enforce a controller-global policy by rejecting or overriding namespace
EnableLimiter overrides when computing e.optimizer, or (B) partition the request
set by effective namespace EnableLimiter and run optimization separately for
each partition (creating/choosing the optimizer based on enableLimiter for each
partition) before merging results; locate helper symbols analyzerName,
enableLimiter, hasQMAnalyzerConfig, e.optimizer,
pipeline.NewGreedyByScoreOptimizer(), and pipeline.NewCostAwareOptimizer() to
implement the chosen approach.
🟡 Minor comments (11)
pkg/solver/solver_test.go-617-619 (1)

617-619: ⚠️ Potential issue | 🟡 Minor

Strengthen assertion at Line 617 to validate the actual minimum allocation.

Current condition only rejects values > 50.0; it can still pass when a non-minimum allocation is selected but happens to be <= 50.0.

Proposed fix
+	expectedMin := allocs[0].Value()
+	for _, alloc := range allocs[1:] {
+		if alloc.Value() < expectedMin {
+			expectedMin = alloc.Value()
+		}
+	}
+
 	selectedAlloc := server.Allocation()
 	if selectedAlloc == nil {
 		t.Error("Expected server to have an allocation after SolveUnlimited")
-	} else if selectedAlloc.Value() > 50.0 {
+	} else if selectedAlloc.Value() != expectedMin {
 		// The allocation with minimum value should be selected
-		t.Errorf("Expected allocation with minimum value to be selected, got value %f", selectedAlloc.Value())
+		t.Errorf("Expected allocation with minimum value %f to be selected, got %f", expectedMin, selectedAlloc.Value())
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/solver/solver_test.go` around lines 617 - 619, The test currently only
checks selectedAlloc.Value() > 50.0 which can miss non-minimal selections
<=50.0; update the test to compute the actual minimum allocation by iterating
the allocations slice (e.g., allocations or allocs), find the minimal allocation
value (expectedMin), then assert that selectedAlloc.Value() equals expectedMin
within a small epsilon (use math.Abs(selectedAlloc.Value()-expectedMin) < 1e-9)
and fail the test (t.Errorf/t.Fatalf) if it does not; reference selectedAlloc
and Value() when performing the comparison so the assertion verifies the true
minimum selection.
deploy/lib/infra_wva.sh-103-104 (1)

103-104: ⚠️ Potential issue | 🟡 Minor

Unquoted variable expansion in Helm arguments.

The $KV_CACHE_THRESHOLD and $QUEUE_LENGTH_THRESHOLD values are not quoted. If these contain spaces or shell metacharacters, command parsing could be affected.

-        ${KV_CACHE_THRESHOLD:+--set wva.capacityScaling.default.kvCacheThreshold=$KV_CACHE_THRESHOLD} \
-        ${QUEUE_LENGTH_THRESHOLD:+--set wva.capacityScaling.default.queueLengthThreshold=$QUEUE_LENGTH_THRESHOLD} \
+        ${KV_CACHE_THRESHOLD:+--set "wva.capacityScaling.default.kvCacheThreshold=$KV_CACHE_THRESHOLD"} \
+        ${QUEUE_LENGTH_THRESHOLD:+--set "wva.capacityScaling.default.queueLengthThreshold=$QUEUE_LENGTH_THRESHOLD"} \

This follows the same unquoted pattern as existing lines (101-102, 105-106), so risk is low if values are numeric. Consider quoting all such expansions for defense-in-depth. As per coding guidelines: "Quote all variables to prevent injection".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/infra_wva.sh` around lines 103 - 104, The Helm --set expansions
for KV_CACHE_THRESHOLD and QUEUE_LENGTH_THRESHOLD should be quoted to prevent
word-splitting or shell metacharacter injection; update the conditional
expansions that set wva.capacityScaling.default.kvCacheThreshold and
wva.capacityScaling.default.queueLengthThreshold to wrap the variable expansions
in double quotes (i.e., quote the ${KV_CACHE_THRESHOLD} and
${QUEUE_LENGTH_THRESHOLD} expansions) so each --set value is passed as a single,
safe argument.
docs/user-guide/hpa-integration.md-490-490 (1)

490-490: ⚠️ Potential issue | 🟡 Minor

Update the earlier in-page anchor after renaming this heading.

This heading now generates a different fragment ID, but Line 66 still links to the old ...hpa-integrationyaml anchor, so that note above no longer jumps to the HPA example.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/hpa-integration.md` at line 490, The in-page anchor pointing
to the HPA example needs updating because the heading "### HPA Configuration
Example (`config/samples/hpa/hpa.yaml`)" changed its fragment ID; find the link
referenced earlier (the note that currently targets the old anchor
`...hpa-integrationyaml`) and update its href/anchor to match the new fragment
generated from this heading (typically derived from the heading text, e.g.,
`hpa-configuration-example-configsampleshpa-hpa-yaml`), ensuring the note on
Line 66 jumps to the HPA example correctly.
deploy/lib/llm_d_nightly_install.sh-43-45 (1)

43-45: ⚠️ Potential issue | 🟡 Minor

WVA_METRICS_SECURE=false disables authentication on the metrics endpoint.

Setting WVA_METRICS_SECURE=false for OpenShift removes bearer token auth from /metrics. If this endpoint is network-accessible, metrics (including potentially sensitive operational data) become publicly readable. Verify this is intentional for the nightly environment and that network policies restrict access.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/llm_d_nightly_install.sh` around lines 43 - 45, The defaulting of
WVA_METRICS_SECURE to "false" disables auth on /metrics; change the logic so
OpenShift does not silently disable metrics auth: set WVA_METRICS_SECURE default
to "true" or only allow "false" when explicitly overridden (e.g., if [ -z
"${WVA_METRICS_SECURE+x}" ] then export WVA_METRICS_SECURE=true), add a clear
comment next to WVA_METRICS_SECURE and a runtime warning if
ENVIRONMENT=openshift and WVA_METRICS_SECURE=false, and ensure
MONITORING_NAMESPACE and ENVIRONMENT checks remain unchanged so the variable
names (WVA_METRICS_SECURE, MONITORING_NAMESPACE, ENVIRONMENT) are easy to locate
for this fix.
test/benchmark/job_waiter.go-77-80 (1)

77-80: ⚠️ Potential issue | 🟡 Minor

Race condition: fixed sleep after Job deletion.

time.Sleep(3 * time.Second) assumes the Job is deleted within 3 seconds. Under cluster load, deletion may take longer, causing Create to fail with "already exists". Use DeletePropagationForeground and poll for Job absence, or handle AlreadyExists error on create.

Proposed fix: poll for deletion completion
 	_ = k8sClient.BatchV1().Jobs(namespace).Delete(ctx, jobName, metav1.DeleteOptions{
-		PropagationPolicy: ptr.To(metav1.DeletePropagationBackground),
+		PropagationPolicy: ptr.To(metav1.DeletePropagationForeground),
 	})
-	time.Sleep(3 * time.Second)
+	// Poll until Job is gone
+	_ = wait.PollUntilContextTimeout(ctx, 500*time.Millisecond, 30*time.Second, true, func(ctx context.Context) (bool, error) {
+		_, err := k8sClient.BatchV1().Jobs(namespace).Get(ctx, jobName, metav1.GetOptions{})
+		return apierrors.IsNotFound(err), nil
+	})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/job_waiter.go` around lines 77 - 80, The fixed 3s sleep after
calling k8sClient.BatchV1().Jobs(namespace).Delete(...) is a race; replace the
Background propagation and sleep with a deterministic wait: call Delete with
metav1.DeletePropagationForeground (replace
ptr.To(metav1.DeletePropagationBackground)), then poll using the
Jobs(namespace).Get(ctx, jobName, ...) until it returns NotFound (or timeout)
before proceeding to create, or alternatively catch and handle the AlreadyExists
error on create (Jobs(...).Create) by retrying the poll; update the logic around
the Delete call and any code that uses time.Sleep to use the polling/timeout
approach instead.
internal/engines/analyzers/saturation_v2/capacity_store.go-181-181 (1)

181-181: ⚠️ Potential issue | 🟡 Minor

Inconsistent live-source check can break priority selection.

Line 181 still uses "live" literal on best.LearnedFrom while other paths use learnedFromLive. If the constant changes, live records may stop being preferred.

Suggested fix
-		if best == nil || (best.LearnedFrom != "live" && rec.LearnedFrom == learnedFromLive) {
+		if best == nil || (best.LearnedFrom != learnedFromLive && rec.LearnedFrom == learnedFromLive) {
 			best = rec
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engines/analyzers/saturation_v2/capacity_store.go` at line 181, The
conditional uses a string literal "live" when checking best.LearnedFrom which is
inconsistent with the constant learnedFromLive used elsewhere; update the
comparison so both sides use the learnedFromLive constant (i.e., replace the
literal "live" in the if condition involving best.LearnedFrom and
rec.LearnedFrom) to ensure consistent live-source priority logic for the
best/rec selection code in capacity_store.go.
test/benchmark/suite_test.go-91-103 (1)

91-103: ⚠️ Potential issue | 🟡 Minor

Fail fast when thanos-querier has no web port.

If the service lookup succeeds but the named port is absent, this silently falls back to 9090 and the later port-forward failure points at the wrong problem. Track whether web was found and Expect it before continuing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/suite_test.go` around lines 91 - 103, The code currently looks
up the thanos-querier Service and scans svc.Spec.Ports for a port with port.Name
== "web" but silently leaves promServicePort at its default (9090) if not found;
modify the block that uses k8sClient.CoreV1().Services(promNamespace).Get(...)
and the loop over svc.Spec.Ports to track a boolean (e.g., webPortFound), set
promServicePort when you find the "web" port, then call
Expect(webPortFound).To(BeTrue(), "thanos-querier service missing named port
'web'") (or equivalent) before continuing so the test fails fast and clearly
when the named port is absent.
docs/user-guide/keda-integration.md-44-44 (1)

44-44: ⚠️ Potential issue | 🟡 Minor

Fix the broken intra-doc anchor.

The fragment on Line 44 does not resolve to the heading on Line 236, so the link at the top of the guide is dead. Use a simple heading/fragment pair or add an explicit HTML anchor.

Also applies to: 236-236

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/keda-integration.md` at line 44, The intra-doc anchor used in
the link "at the end of this
doc](`#keda-scaledobject-configuration-example-configsampleskeda-scaledobjectyaml`)"
is broken; update the target to match the actual heading slug or add an explicit
HTML anchor above the target heading. Locate the referenced end-of-doc heading
(the heading around line 236, e.g. "KEDA ScaledObject configuration example —
config/samples/keda/scaledobject.yaml") and either change the link to the real
fragment generated by the heading text or insert an HTML anchor like <a
id="keda-scaledobject-configuration-example-configsampleskeda-scaledobjectyaml"></a>
immediately before that heading so the top-of-doc link resolves correctly.
docs/user-guide/saturation-analyzer.md-243-268 (1)

243-268: ⚠️ Potential issue | 🟡 Minor

Fenced code block missing language identifier.

The ASCII diagram code block (line 243) lacks a language specifier. Add text or plaintext to satisfy linters and improve rendering consistency.

Proposed fix
-```
+```text
 ┌─────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/saturation-analyzer.md` around lines 243 - 268, The fenced
ASCII diagram block beginning with "┌─────────────┐" should include a language
identifier to satisfy linters and ensure consistent rendering; update the
opening fence from ``` to ```text (or ```plaintext) so the block becomes a
labeled plaintext code block.
docs/user-guide/saturation-analyzer.md-27-27 (1)

27-27: ⚠️ Potential issue | 🟡 Minor

Missing heading for Table of Contents entry.

Line 27 references #multi-variant-analysis in the TOC, but no corresponding heading exists in the document. Either remove this TOC entry or add the missing ## Multi-Variant Analysis heading.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/user-guide/saturation-analyzer.md` at line 27, The TOC contains a link
to `#multi-variant-analysis` but there is no matching heading; fix by adding a
document heading that exactly matches the anchor (e.g., add a line "##
Multi-Variant Analysis" where the section content should go) or remove the TOC
entry; ensure the heading text "Multi-Variant Analysis" matches the TOC anchor
so the link resolves.
docs/developer-guide/BENCHMARK_PHASE_1_AND_2.md-42-46 (1)

42-46: ⚠️ Potential issue | 🟡 Minor

Drop the hard-coded gh pr checkout 900 step.

This locks the doc to a historical PR and will put readers on the wrong branch once that PR is closed or no longer matches the current benchmark code. The instructions should tell users to check out the branch or commit they actually want to benchmark.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/developer-guide/BENCHMARK_PHASE_1_AND_2.md` around lines 42 - 46, Remove
the hard-coded git command "gh pr checkout 900" and replace it with a generic
instruction telling users to checkout the specific branch or commit they intend
to benchmark (e.g., "git checkout <branch-or-commit>" or "gh pr checkout
<PR-number> if applicable"); update the step text around the unique string "gh
pr checkout 900" so it advises users to verify and select the correct
branch/commit for the benchmark instead of using a fixed PR number.
🧹 Nitpick comments (7)
internal/config/config_test.go (1)

69-69: Drop the ignored goroutine parameter entirely.

Passing i while discarding it (_ int) adds noise and suggests a captured-loop-index concern that no longer exists.

Suggested cleanup
-		go func(_ int) {
+		go func() {
 			defer wg.Done()
 			for j := 0; j < iterations; j++ {
@@
-		}(i)
+		}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/config_test.go` at line 69, Remove the unused goroutine
parameter: change the literal "go func(_ int) {" to a parameterless "go func()
{" and remove the corresponding argument where it's invoked (the trailing "}(i)"
should become "}") so the goroutine no longer declares or passes an ignored loop
index; update the closure invocation accordingly in
internal/config/config_test.go where "go func(_ int) {" and "}(i)" appear.
internal/utils/scaletarget/fetch_test.go (1)

836-838: Make errorType assertion handling exhaustive to prevent false-positive tests.

Line 836 currently validates only "NotFound". Any new non-empty errorType value will bypass type-specific checks and may let regressions pass unnoticed.

Proposed diff
-				if tt.errorType == "NotFound" {
-					assert.True(t, apierrors.IsNotFound(err), "expected NotFound error")
-				}
+				switch tt.errorType {
+				case "NotFound":
+					assert.True(t, apierrors.IsNotFound(err), "expected NotFound error")
+				case "":
+					// no type-specific assertion
+				default:
+					t.Fatalf("unhandled errorType in test case: %q", tt.errorType)
+				}
As per coding guidelines, this addresses review priority "Bug-prone patterns and error handling gaps".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/utils/scaletarget/fetch_test.go` around lines 836 - 838, The test
only checks tt.errorType == "NotFound" and ignores any other non-empty
errorType, which can mask regressions; update the assertion logic around
tt.errorType in the test (the block that currently calls
apierrors.IsNotFound(err)) to be exhaustive: if tt.errorType == "NotFound"
assert apierrors.IsNotFound(err), else if tt.errorType == "" assert no error (or
use assert.NoError), and otherwise fail the test (t.Fatalf or assert.Failf) with
a clear message about the unexpected tt.errorType value so any new errorType
values must be handled explicitly.
internal/engines/pipeline/cost_aware_optimizer.go (1)

109-109: Consider applying the same pattern to costAwareScaleDown for consistency.

The augmented assignment operator is now used here, but line 188 in costAwareScaleDown still uses the explicit form. For consistent style:

Consistency fix for scale-down
-    targets[vc.VariantName] = current - replicasToRemove
+    targets[vc.VariantName] -= replicasToRemove
     remaining -= float64(replicasToRemove) * vc.PerReplicaCapacity
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/engines/pipeline/cost_aware_optimizer.go` at line 109, The style is
inconsistent: you used the augmented assignment operator in the scale-up code
(targets[vc.VariantName] += replicasNeeded) but the costAwareScaleDown function
still uses the explicit form; update costAwareScaleDown to use the corresponding
compound assignment (e.g., targets[vc.VariantName] -= <replicasToRemove
variable>) when decreasing targets so it matches the pattern and references the
same targets and vc.VariantName identifiers.
internal/config/helpers.go (1)

104-104: Resolve the unresolved TODO in the config source path.

Line 104 leaves runtime behavior validation as a TODO without a tracked issue/owner. Please convert this into a tracked issue (or implement/remove it) to avoid silent drift in config behavior.

I can draft the issue text (scope, acceptance criteria, and test checks) if you want me to open one.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/helpers.go` at line 104, Replace the stray TODO comment
"TODO: check setting QUEUEING_MODEL_CONFIG_MAP_NAME" that sits next to the
`return name` line with an actionable change: either implement the
env-var/config override logic (read QUEUEING_MODEL_CONFIG_MAP_NAME and use it to
compute/validate the returned name in the helper function) or create a tracked
issue and remove the TODO. If you implement it, update the helper to prefer
QUEUEING_MODEL_CONFIG_MAP_NAME when present, add unit tests that cover default
and overridden behavior, and add a short comment describing the decision; if you
create an issue, include scope, acceptance criteria, and test checks in the
issue and replace the TODO with a one-line reference to that issue ID. Ensure
the change touches the code that currently returns `name` and the TODO text so
future readers see either implemented behavior or a tracked task.
config/samples/hpa/kustomization.yaml (1)

3-4: Non-standard metadata block in Kustomization.

The kustomize.config.k8s.io/v1beta1 Kustomization kind does not use a metadata section. While kustomize will ignore it, this deviates from the standard schema. Consider removing it for cleanliness:

Suggested fix
 apiVersion: kustomize.config.k8s.io/v1beta1
 kind: Kustomization
-metadata:
-  name: hpa-sample
 resources:
 - va.yaml
 - hpa.yaml
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@config/samples/hpa/kustomization.yaml` around lines 3 - 4, Remove the
non-standard metadata block from the Kustomization by deleting the top-level
"metadata" mapping (the "metadata:" key and its "name: hpa-sample" entry) so the
file contains only valid Kustomization fields (e.g., apiVersion, kind,
resources, etc.); locate the Kustomization that currently includes the
"metadata" block and remove that block to conform to the
kustomize.config.k8s.io/v1beta1 Kustomization schema.
test/benchmark/job_waiter.go (1)

29-34: Use typed constants for condition status comparison.

Comparing cond.Status to string "True" works but is less idiomatic than using corev1.ConditionTrue.

Suggested change
-		if cond.Type == batchv1.JobComplete && cond.Status == "True" {
+		if cond.Type == batchv1.JobComplete && cond.Status == corev1.ConditionTrue {
 			return true, nil
 		}
-		if cond.Type == batchv1.JobFailed && cond.Status == "True" {
+		if cond.Type == batchv1.JobFailed && cond.Status == corev1.ConditionTrue {
 			return false, fmt.Errorf("job failed: %s", cond.Message)
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/job_waiter.go` around lines 29 - 34, The condition checks in
job_waiter.go compare cond.Status to the literal "True"; update those
comparisons to use the typed constant corev1.ConditionTrue instead (replace
occurrences where cond.Status == "True" in the branches handling
batchv1.JobComplete and batchv1.JobFailed), ensuring you import corev1 if not
already imported and preserve the existing return values and fmt.Errorf call.
test/benchmark/workload.go (1)

22-22: Hardcoded image version may become stale.

Image ghcr.io/vllm-project/guidellm:v0.5.4 is hardcoded. Consider parameterizing or documenting the version update policy for benchmark tooling.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/workload.go` at line 22, The hardcoded Docker image string
assigned to the image variable ("ghcr.io/vllm-project/guidellm:v0.5.4") in
workload.go can become stale; change it to be configurable (e.g., read from an
environment variable, CLI flag, or config struct) and fall back to a sensible
default, or add a clear comment/doc explaining the version update policy; update
references to image where it's used so they consume the new configurable source
(look for the image variable in workload.go and any functions that pass it
along).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Pro

Run ID: e7cf4881-0b1e-4bc3-8bca-1cf711441107

📥 Commits

Reviewing files that changed from the base of the PR and between 7cb1465 and abe1f68.

⛔ Files ignored due to path filters (5)
  • docs/developer-guide/benchmark-panels/panel-desired-replicas.png is excluded by !**/*.png
  • docs/developer-guide/benchmark-panels/panel-kv-cache.png is excluded by !**/*.png
  • docs/developer-guide/benchmark-panels/panel-queue-depth.png is excluded by !**/*.png
  • docs/developer-guide/benchmark-panels/panel-replicas.png is excluded by !**/*.png
  • docs/developer-guide/benchmark-panels/panel-saturation.png is excluded by !**/*.png
📒 Files selected for processing (130)
  • .github/workflows/ci-benchmark.yaml
  • .github/workflows/nightly-e2e-cks.yaml
  • .gitignore
  • .golangci.yml
  • CONTRIBUTING.md
  • Makefile
  • README.md
  • api/v1alpha1/variantautoscaling_types.go
  • charts/workload-variant-autoscaler/README.md
  • cmd/main.go
  • config/samples/hpa/hpa.yaml
  • config/samples/hpa/kustomization.yaml
  • config/samples/hpa/va.yaml
  • config/samples/keda/kustomization.yaml
  • config/samples/keda/scaledobject.yaml
  • config/samples/keda/va.yaml
  • deploy/README.md
  • deploy/install.sh
  • deploy/lib/infra_llmd.sh
  • deploy/lib/infra_wva.sh
  • deploy/lib/llm_d_nightly_install.sh
  • docs/CHANGELOG-v0.5.0.md
  • docs/README.md
  • docs/developer-guide/BENCHMARK_PHASE_1_AND_2.md
  • docs/developer-guide/agentic-workflows.md
  • docs/developer-guide/benchmark-summary.md
  • docs/developer-guide/testing.md
  • docs/tutorials/demo.md
  • docs/tutorials/guidellm-sample.md
  • docs/tutorials/parameter-estimation.md
  • docs/tutorials/vllm-samples.md
  • docs/user-guide/LeaderWorkerSet-support.md
  • docs/user-guide/configuration.md
  • docs/user-guide/hpa-integration.md
  • docs/user-guide/installation.md
  • docs/user-guide/keda-integration.md
  • docs/user-guide/multi-controller-isolation.md
  • docs/user-guide/saturation-analyzer.md
  • docs/user-guide/scale-from-zero.md
  • internal/actuator/actuator.go
  • internal/actuator/actuator_test.go
  • internal/actuator/suite_test.go
  • internal/collector/collector.go
  • internal/collector/source/pod/pod_scraping_source.go
  • internal/collector/source/pod_va_mapper_test.go
  • internal/collector/source/query_template.go
  • internal/config/config.go
  • internal/config/config_test.go
  • internal/config/helpers.go
  • internal/config/loader.go
  • internal/config/saturation_scaling.go
  • internal/config/scale_to_zero.go
  • internal/config/validation.go
  • internal/constants/labels.go
  • internal/controller/configmap_bootstrap.go
  • internal/controller/configmap_bootstrap_test.go
  • internal/controller/configmap_helpers.go
  • internal/controller/configmap_reconciler.go
  • internal/controller/configmap_reconciler_test.go
  • internal/controller/indexers/indexers_test.go
  • internal/controller/indexers/suite_test.go
  • internal/controller/predicates.go
  • internal/controller/suite_test.go
  • internal/controller/variantautoscaling_controller.go
  • internal/discovery/k8s_with_gpu_operator_test.go
  • internal/engines/analyzers/queueingmodel/analyzer.go
  • internal/engines/analyzers/queueingmodel/tuner/defaults.go
  • internal/engines/analyzers/queueingmodel/tuner/stasher.go
  • internal/engines/analyzers/queueingmodel/tuner/tuner.go
  • internal/engines/analyzers/queueingmodel/tuner/utils.go
  • internal/engines/analyzers/saturation_v2/analyzer.go
  • internal/engines/analyzers/saturation_v2/capacity_store.go
  • internal/engines/analyzers/saturation_v2/types.go
  • internal/engines/pipeline/cost_aware_optimizer.go
  • internal/engines/pipeline/default_limiter_test.go
  • internal/engines/pipeline/enforcer.go
  • internal/engines/pipeline/greedy_score_optimizer.go
  • internal/engines/pipeline/greedy_score_optimizer_test.go
  • internal/engines/pipeline/type_inventory.go
  • internal/engines/saturation/engine.go
  • internal/engines/saturation/engine_test.go
  • internal/engines/saturation/engine_v2.go
  • internal/engines/saturation/suite_test.go
  • internal/engines/scalefromzero/engine.go
  • internal/engines/scalefromzero/engine_test.go
  • internal/interfaces/saturation_analyzer.go
  • internal/metrics/metrics.go
  • internal/saturation/analyzer.go
  • internal/utils/scaletarget/fetch_test.go
  • internal/utils/tls.go
  • internal/utils/utils.go
  • internal/utils/utils_test.go
  • pkg/analyzer/mm1kmodel.go
  • pkg/analyzer/mm1modelstatedependent.go
  • pkg/analyzer/queueanalyzer.go
  • pkg/config/defaults.go
  • pkg/core/allocation.go
  • pkg/core/allocation_test.go
  • pkg/core/server_test.go
  • pkg/core/system_test.go
  • pkg/solver/greedy.go
  • pkg/solver/greedy_test.go
  • pkg/solver/optimizer.go
  • pkg/solver/solver_test.go
  • test/benchmark/config.go
  • test/benchmark/epp_config.go
  • test/benchmark/grafana.go
  • test/benchmark/hpa_helpers.go
  • test/benchmark/job_waiter.go
  • test/benchmark/prefill_heavy_benchmark_test.go
  • test/benchmark/scale_up_latency_benchmark_test.go
  • test/benchmark/setup_test.go
  • test/benchmark/suite_test.go
  • test/benchmark/workload.go
  • test/chart/client_only_install_test.go
  • test/e2e/fixtures/hpa_builder.go
  • test/e2e/fixtures/model_service_builder.go
  • test/e2e/fixtures/scaled_object_builder.go
  • test/e2e/fixtures/workload_builder.go
  • test/e2e/limiter_test.go
  • test/e2e/pod_scraping_test.go
  • test/e2e/saturation_analyzer_path_test.go
  • test/e2e/scale_from_zero_test.go
  • test/e2e/smoke_test.go
  • test/e2e/suite_test.go
  • test/utils/debug_helpers.go
  • test/utils/e2eutils.go
  • test/utils/feature_gates.go
  • test/utils/pod_scraping_test_helpers.go
  • test/utils/resources/llmdsim.go
💤 Files with no reviewable changes (8)
  • internal/engines/pipeline/enforcer.go
  • docs/tutorials/guidellm-sample.md
  • docs/tutorials/vllm-samples.md
  • docs/developer-guide/agentic-workflows.md
  • docs/tutorials/demo.md
  • docs/tutorials/parameter-estimation.md
  • .github/workflows/nightly-e2e-cks.yaml
  • internal/utils/utils_test.go

Comment on lines +164 to +170
build-image:
needs: gate
if: needs.gate.outputs.run_benchmark == 'true' && (needs.gate.outputs.platform == 'openshift' || github.event.inputs.platform == 'openshift')
runs-on: ubuntu-latest
outputs:
image_tag: ${{ steps.build.outputs.image_tag }}
steps:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add explicit job-level token permissions to build-image.

This new job inherits the repository default GITHUB_TOKEN scope. On a job that also handles GHCR credentials, that's unnecessary privilege expansion (CWE-250/CWE-272) if any later step is compromised.

Suggested fix
   build-image:
     needs: gate
+    permissions:
+      contents: read
     if: needs.gate.outputs.run_benchmark == 'true' && (needs.gate.outputs.platform == 'openshift' || github.event.inputs.platform == 'openshift')
     runs-on: ubuntu-latest

As per coding guidelines, "Set least-privilege permissions per job, not workflow level".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci-benchmark.yaml around lines 164 - 170, The build-image
job currently inherits full repository token scopes; restrict GITHUB_TOKEN by
adding a job-level permissions mapping under the build-image job (referencing
the job name build-image) to grant only the least privileges required for the
steps that interact with GHCR and the repo — e.g. set permissions.contents: read
and permissions.packages: write (and any other specific minimal permission your
steps need) so the job no longer uses the default elevated token; place this
permissions block directly under the build-image job definition alongside
needs/runs-on/outputs/steps.

Comment on lines +171 to +177
- name: Checkout source
uses: actions/checkout@v4
with:
ref: ${{ needs.gate.outputs.pr_head_sha }}

- name: Log in to GHCR
uses: docker/login-action@v3
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

find . -name "ci-benchmark.yaml" -o -name "ci-benchmark.yml"

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 123


🏁 Script executed:

wc -l ./.github/workflows/ci-benchmark.yaml

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 128


🏁 Script executed:

sed -n '171,177p' ./.github/workflows/ci-benchmark.yaml
sed -n '473,476p' ./.github/workflows/ci-benchmark.yaml
sed -n '489,499p' ./.github/workflows/ci-benchmark.yaml
sed -n '978,993p' ./.github/workflows/ci-benchmark.yaml

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 1440


🏁 Script executed:

rg 'uses:' ./.github/workflows/ci-benchmark.yaml

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 718


Pin all GitHub Actions by full commit SHA.

This workflow uses mutable version tags (@v3, @v4, @v6, @v7) for all actions. Supply chain attacks can retarget these tags to inject malicious code with access to GHCR secrets, PR write permissions, and cluster credentials (CWE-494).

Replace version tags with full commit SHAs:

  • actions/checkout@v4 → commit SHA
  • docker/login-action@v3 → commit SHA
  • actions/github-script@v7 → commit SHA
  • actions/setup-go@v6 → commit SHA
  • actions/upload-artifact@v4 → commit SHA
  • docker/setup-buildx-action@v3 → commit SHA

Applies to lines 171-177, 473-476, 489-499, 978-993 and all other action invocations in this workflow.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci-benchmark.yaml around lines 171 - 177, The workflow
uses mutable action tags (e.g., actions/checkout@v4, docker/login-action@v3,
actions/github-script@v7, actions/setup-go@v6, actions/upload-artifact@v4,
docker/setup-buildx-action@v3); replace each of these version tags with the
corresponding full commit SHA to pin the action to an immutable revision across
the entire file (not just lines shown). Locate every invocation of those action
names (e.g., the steps named "Checkout source", "Log in to GHCR", any steps
using actions/github-script, actions/setup-go, actions/upload-artifact,
docker/setup-buildx-action) and update the ref to the exact commit SHA from the
official action repository release commit. Ensure all occurrences are updated
consistently and verify the SHAs are correct by cross-checking the upstream
action repos before committing.

Comment on lines +501 to +513
- name: Install tools (kubectl, oc, helm, make)
run: |
sudo apt-get update && sudo apt-get install -y make
KUBECTL_VERSION="v1.31.0"
curl -fsSL --retry 3 --retry-delay 5 -o kubectl "https://dl.k8s.io/release/${KUBECTL_VERSION}/bin/linux/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
curl -fsSL --retry 3 --retry-delay 5 -O "https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz"
tar -xzf openshift-client-linux.tar.gz
sudo mv oc /usr/local/bin/
rm -f openshift-client-linux.tar.gz kubectl README.md
curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Replace curl | bash Helm installation on the self-hosted runner.

Downloading and executing a remote script from main is a direct supply-chain/RCE path (CWE-494). In this job it runs before cluster operations and after repository checkout, so a compromised upstream script gets full access to the runner and benchmark credentials.

Suggested fix
-          curl -fsSL --retry 3 --retry-delay 5 https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
+          HELM_VERSION="v3.17.0"
+          curl -fsSLo helm.tar.gz "https://get.helm.sh/helm-${HELM_VERSION}-linux-amd64.tar.gz"
+          echo "<published-sha256>  helm.tar.gz" | sha256sum -c -
+          tar -xzf helm.tar.gz linux-amd64/helm
+          sudo install linux-amd64/helm /usr/local/bin/helm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci-benchmark.yaml around lines 501 - 513, The Helm install
step uses an unsafe curl | bash pipeline (the line invoking
"https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash");
replace it with a deterministic, verifiable installation: download a specific
Helm release archive from the official GitHub releases (e.g., a pinned tag),
verify the downloaded archive against the accompanying SHA256 checksum or GPG
signature, extract the binary and move it to /usr/local/bin; update the step
labeled "Install tools (kubectl, oc, helm, make)" to perform these explicit
download + checksum verification + extraction actions instead of piping remote
script to bash.

Comment on lines +606 to +628
# Start APIService guard: KEDA on this cluster continuously reclaims the
# external.metrics.k8s.io APIService. This background loop re-patches it
# every 8 seconds so the HPA can read wva_desired_replicas during the benchmark.
# Key fix: caBundle must be set to null because KEDA sets it, and Kubernetes
# rejects insecureSkipTLSVerify=true when caBundle is present.
MONITORING_NS="openshift-user-workload-monitoring"
(
while true; do
sleep 8
current_svc=$(kubectl get apiservice v1beta1.external.metrics.k8s.io -o jsonpath='{.spec.service.name}' 2>/dev/null)
current_ns=$(kubectl get apiservice v1beta1.external.metrics.k8s.io -o jsonpath='{.spec.service.namespace}' 2>/dev/null)
if [ "$current_svc" != "prometheus-adapter" ] || [ "$current_ns" != "$MONITORING_NS" ]; then
echo "[apiservice-guard] KEDA reclaimed (now: $current_svc/$current_ns), re-patching..."
kubectl patch apiservice v1beta1.external.metrics.k8s.io --type=merge -p "{
\"spec\": {
\"caBundle\": null,
\"insecureSkipTLSVerify\": true,
\"service\": {
\"name\": \"prometheus-adapter\",
\"namespace\": \"$MONITORING_NS\"
}
}
}" 2>&1 || true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't disable TLS verification on the cluster-wide external metrics APIService.

This rewires external.metrics.k8s.io to an endpoint with insecureSkipTLSVerify: true, which turns off certificate validation for a shared cluster API surface (CWE-295). Anything able to impersonate or intercept that service can spoof metrics and drive HPA decisions while the benchmark is running.

Use a real CA bundle from the adapter service certificate instead of nulling caBundle and forcing insecure TLS.

Suggested fix
-                    "caBundle": null,
-                    "insecureSkipTLSVerify": true,
+                    "caBundle": "$PROM_ADAPTER_CA_BUNDLE_B64",
+                    "insecureSkipTLSVerify": false,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/ci-benchmark.yaml around lines 606 - 628, The patch loop
currently forces insecureSkipTLSVerify: true and clears caBundle for the
apiservice v1beta1.external.metrics.k8s.io (in the kubectl patch call inside the
while loop using MONITORING_NS and service prometheus-adapter), which disables
TLS verification; instead, retrieve the adapter's serving CA bundle (e.g., from
the prometheus-adapter TLS secret or the Service/Endpoints certificate chain),
base64-encode that CA data, and set "spec.caBundle" to that encoded value in the
kubectl patch payload while removing or setting "spec.insecureSkipTLSVerify" to
false; keep the same apiservice name and service fields (prometheus-adapter and
MONITORING_NS) but replace the null-ca/insecureSkipTLSVerify logic with code
that fetches the real CA and applies it in the patch.

Comment thread deploy/lib/infra_llmd.sh
Comment on lines +104 to +109
# Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0)
if [ -n "$LLMD_IMAGE_TAG" ]; then
log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG"
yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
fi
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

LLMD_IMAGE_TAG is rewriting repositories, not just tags.

Setting a tag override here hardcodes ghcr.io/llm-d/... for decode and routing, so any mirrored registry or alternative base image in LLM_D_MODELSERVICE_VALUES gets silently replaced. It also leaves the prefill image on the old version, which can roll out mixed llm-d components.

🐛 Proposed fix
-      yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
-      yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
+      yq eval '
+        .decode.containers[0].image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG)) |
+        (.prefill.containers[0].image? |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))) |
+        .routing.proxy.image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))
+      ' -i "$LLM_D_MODELSERVICE_VALUES"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0)
if [ -n "$LLMD_IMAGE_TAG" ]; then
log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG"
yq eval ".decode.containers[0].image = \"ghcr.io/llm-d/llm-d-cuda:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
yq eval ".routing.proxy.image = \"ghcr.io/llm-d/llm-d-routing-sidecar:${LLMD_IMAGE_TAG}\"" -i "$LLM_D_MODELSERVICE_VALUES"
fi
# Override llm-d container image tags if set (e.g. upgrade from v0.3.0 to v0.6.0)
if [ -n "$LLMD_IMAGE_TAG" ]; then
log_info "Overriding llm-d image tags to $LLMD_IMAGE_TAG"
yq eval '
.decode.containers[0].image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG)) |
(.prefill.containers[0].image? |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))) |
.routing.proxy.image |= sub(":[^:@]+$"; ":" + strenv(LLMD_IMAGE_TAG))
' -i "$LLM_D_MODELSERVICE_VALUES"
fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/lib/infra_llmd.sh` around lines 104 - 109, The override currently
hardcodes the ghcr.io/llm-d/... repositories by replacing the entire image
string when LLMD_IMAGE_TAG is set; change it to only replace the image tag
portion for decode and routing (and also update the prefill image) so mirrored
registries or custom repos are preserved. Locate the yq calls that set images in
LLM_D_MODELSERVICE_VALUES (the lines referencing LLMD_IMAGE_TAG and yq eval for
.decode.containers[0].image and .routing.proxy.image), and modify them to update
just the part after the ":" (or append ":${LLMD_IMAGE_TAG}" if no tag exists)
rather than overwriting the full repository; do the same for the prefill image
field to ensure all llm-d components use the same tag.

Comment on lines +73 to +77
if strings.Contains(cm.Data[configKey], "flowControl") {
return nil
}

cm.Data[configKey] = DesiredEPPConfig
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't use a flowControl substring as the idempotence check.

This returns early whenever the current config mentions flowControl, even if the scorer weights differ from DesiredEPPConfig. The benchmark can then report the 2/2/3 policy while actually running with a different EPP configuration.

Suggested fix
-	if strings.Contains(cm.Data[configKey], "flowControl") {
+	if cm.Data[configKey] == DesiredEPPConfig {
 		return nil
 	}
 
 	cm.Data[configKey] = DesiredEPPConfig
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if strings.Contains(cm.Data[configKey], "flowControl") {
return nil
}
cm.Data[configKey] = DesiredEPPConfig
if cm.Data[configKey] == DesiredEPPConfig {
return nil
}
cm.Data[configKey] = DesiredEPPConfig
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/epp_config.go` around lines 73 - 77, The early return uses a
substring check on cm.Data[configKey] for "flowControl" which is too loose;
replace this with a proper idempotence check by comparing the current config to
DesiredEPPConfig (either exact string equality of cm.Data[configKey] ==
DesiredEPPConfig or, preferably, unmarshal cm.Data[configKey] and
DesiredEPPConfig and compare the relevant fields such as scorer weights) so you
only skip writing when the configurations truly match; update the logic around
configKey and DesiredEPPConfig accordingly.

Comment on lines +82 to +92
curlCmd := fmt.Sprintf(
`echo "=== Connectivity check to %s ===" && `+
`echo "Attempting request with model=%s ..." && `+
`HTTP_CODE=$(curl -v -sS -o /tmp/body.txt -w "%%{http_code}" -m 60 -X POST "%s/v1/completions" `+
`-H "Content-Type: application/json" `+
`-d '{"model":"%s","prompt":"Hello","max_tokens":5}' 2>/tmp/curl_debug.txt) && `+
`echo "HTTP_CODE=$HTTP_CODE" && echo "--- Response Body ---" && cat /tmp/body.txt && echo "" && `+
`echo "--- Curl Debug (last 20 lines) ---" && tail -20 /tmp/curl_debug.txt && `+
`if [ "$HTTP_CODE" -ge 200 ] && [ "$HTTP_CODE" -lt 300 ]; then echo "OK"; else echo "FAIL: HTTP $HTTP_CODE"; exit 1; fi`,
gatewayURL, modelID, gatewayURL, modelID,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Shell command injection via unsanitized gatewayURL and modelID (CWE-78).

User-controlled gatewayURL and modelID are interpolated directly into a shell command string via fmt.Sprintf. If either contains shell metacharacters (e.g., "; malicious_cmd;"), arbitrary commands execute in the Job container.

While this is test code and inputs likely come from trusted sources, the pattern is dangerous if reused. Sanitize inputs or pass them as environment variables.

Proposed fix: use environment variables instead of interpolation
-	curlCmd := fmt.Sprintf(
-		`echo "=== Connectivity check to %s ===" && `+
-		// ... interpolated command
-		gatewayURL, modelID, gatewayURL, modelID,
-	)
+	curlCmd := `echo "=== Connectivity check to $GATEWAY_URL ===" && ` +
+		`echo "Attempting request with model=$MODEL_ID ..." && ` +
+		`HTTP_CODE=$(curl -v -sS -o /tmp/body.txt -w "%{http_code}" -m 60 -X POST "$GATEWAY_URL/v1/completions" ` +
+		`-H "Content-Type: application/json" ` +
+		`-d "{\"model\":\"$MODEL_ID\",\"prompt\":\"Hello\",\"max_tokens\":5}" 2>/tmp/curl_debug.txt) && ` +
+		// ... rest of command using $GATEWAY_URL and $MODEL_ID

Then add environment variables to the container spec:

Env: []corev1.EnvVar{
    {Name: "GATEWAY_URL", Value: gatewayURL},
    {Name: "MODEL_ID", Value: modelID},
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/job_waiter.go` around lines 82 - 92, The curlCmd currently
injects gatewayURL and modelID directly into a shell string (via curlCmd) which
risks shell injection; instead, stop interpolating those values and pass them
into the Job container as environment variables (e.g., GATEWAY_URL and MODEL_ID)
and update curlCmd to reference those env vars (e.g., "$GATEWAY_URL" and
"$MODEL_ID") or use a here-doc that reads from env, ensuring gatewayURL and
modelID are set in the container spec Env slice (Env:
[]corev1.EnvVar{{Name:"GATEWAY_URL",Value:gatewayURL},{Name:"MODEL_ID",Value:modelID}})
and remove direct fmt.Sprintf interpolation of gatewayURL/modelID in the curlCmd
variable.

Comment on lines +390 to +399
By("Patching EPP ConfigMap with flowControl + scorer weights (queue=2, kv-cache=2, prefix-cache=3)")
eppDeployName, findErr := FindEPPDeployment(ctx, k8sClient, benchCfg.LLMDNamespace)
Expect(findErr).NotTo(HaveOccurred(), "Failed to find EPP deployment")
patchErr := PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName)
if patchErr != nil {
GinkgoWriter.Printf("WARNING: EPP ConfigMap patch failed (non-fatal): %v\n", patchErr)
} else {
GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3")
eppPatched = true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail fast when the EPP config patch does not apply.

This benchmark is explicitly measuring the flow-control path, but patch failures are only logged as warnings. That lets the run continue and publish artifacts as a "prefill-heavy" result even when the required EPP policy was never enabled.

Suggested fix
 		patchErr := PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName)
-		if patchErr != nil {
-			GinkgoWriter.Printf("WARNING: EPP ConfigMap patch failed (non-fatal): %v\n", patchErr)
-		} else {
-			GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3")
-			eppPatched = true
-		}
+		Expect(patchErr).NotTo(HaveOccurred(), "Prefill-heavy benchmark requires the EPP flowControl config")
+		GinkgoWriter.Println("EPP ConfigMap patched successfully — flowControl enabled, weights 2/2/3")
+		eppPatched = true
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/prefill_heavy_benchmark_test.go` around lines 390 - 399, The
test currently treats PatchEPPConfigMap failures as non-fatal warnings which
allows runs to continue without the required EPP policy; change this to fail
fast by making the patch error fatal: replace the warning branch so that when
PatchEPPConfigMap(ctx, k8sClient, benchCfg.LLMDNamespace, eppDeployName) returns
an error you call Gomega's Expect/Fail (e.g.,
Expect(patchErr).NotTo(HaveOccurred()) or Fail with a clear message) so the test
stops immediately; keep the success branch setting eppPatched = true and the
success log.

Comment on lines +114 to +150
By("Checking external metrics API (best-effort, non-blocking)")
externalMetricsOK := false
Eventually(func() bool {
result, err := k8sClient.RESTClient().
Get().
AbsPath("/apis/external.metrics.k8s.io/v1beta1/namespaces/" + benchCfg.LLMDNamespace + "/wva_desired_replicas").
DoRaw(ctx)
g.Expect(err).NotTo(HaveOccurred(), "External metrics API should be accessible")
g.Expect(string(result)).To(ContainSubstring("wva_desired_replicas"), "Metric should be available")
g.Expect(string(result)).To(ContainSubstring(res.VAName), "Metric should reference the benchmark VA")
GinkgoWriter.Printf("External metrics API confirmed: wva_desired_replicas available for %s\n", res.VAName)
}, 5*time.Minute, 10*time.Second).Should(Succeed())
if err != nil {
GinkgoWriter.Printf(" External metrics API check: %v\n", err)
return false
}
s := string(result)
if strings.Contains(s, "wva_desired_replicas") && strings.Contains(s, res.VAName) {
GinkgoWriter.Printf("External metrics API confirmed: wva_desired_replicas available for %s\n", res.VAName)
externalMetricsOK = true
return true
}
return false
}, 3*time.Minute, 10*time.Second).Should(Or(BeTrue(), Not(BeTrue())))
if !externalMetricsOK {
GinkgoWriter.Println("WARNING: External metrics API not available — HPA may not scale. Proceeding anyway.")
}

By("Waiting for Prometheus to scrape simulator metrics")
Eventually(func(g Gomega) {
By("Checking Prometheus metrics (best-effort, non-blocking)")
promOK := false
Eventually(func() bool {
_, err := promClient.QueryWithRetry(ctx, `vllm:kv_cache_usage_perc`)
g.Expect(err).NotTo(HaveOccurred(), "Prometheus should have KV cache metrics from simulator")
GinkgoWriter.Println("Prometheus confirmed: vllm:kv_cache_usage_perc is available")
}, 5*time.Minute, 15*time.Second).Should(Succeed())
if err == nil {
GinkgoWriter.Println("Prometheus confirmed: vllm:kv_cache_usage_perc is available")
promOK = true
return true
}
return false
}, 2*time.Minute, 15*time.Second).Should(Or(BeTrue(), Not(BeTrue())))
if !promOK {
GinkgoWriter.Println("WARNING: Prometheus vLLM metrics not yet available — KV cache data may be incomplete.")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These Eventually checks never actually wait.

Or(BeTrue(), Not(BeTrue())) matches both boolean outcomes, so each Eventually passes on the first poll. That removes the intended warm-up window and can start the benchmark before either metrics pipeline is ready. Use a real polling helper that returns a boolean after the timeout instead of a tautological matcher.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/benchmark/setup_test.go` around lines 114 - 150, The Eventually calls
are using a tautological matcher Or(BeTrue(), Not(BeTrue())) so they succeed
immediately; change those to a real wait by replacing Or(BeTrue(),
Not(BeTrue())) with BeTrue() so Eventually actually polls until the metric check
returns true or the timeout expires. Specifically update the two Eventually
invocations that run the anonymous funcs which set externalMetricsOK and promOK
(the k8sClient.RESTClient() check for "wva_desired_replicas" and
promClient.QueryWithRetry for `vllm:kv_cache_usage_perc`) to use
.Should(BeTrue()) and keep the existing post-Eventually warnings that examine
externalMetricsOK/promOK.

Comment thread test/utils/e2eutils.go
expectedVersion := os.Getenv("K8S_EXPECTED_VERSION")
if !clusterExists {
scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", fmt.Sprintf("%d", maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion)
scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C2 'K8S_VERSION|CheckIfClusterExistsOrCreate|exec\.Command\("bash", "deploy/kind-emulator/setup\.sh"' test/utils/e2eutils.go deploy/kind-emulator/setup.sh

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 2310


🏁 Script executed:

head -30 test/utils/e2eutils.go | grep -E "^import|^package"

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 108


🏁 Script executed:

sed -n '1,40p' test/utils/e2eutils.go

Repository: opendatahub-io/workload-variant-autoscaler

Length of output: 1017


Pass K8S_VERSION as environment variable, not positional argument.

Line 316 passes "K8S_VERSION="+expectedVersion as a positional argument to bash. The exec.Command function treats all trailing strings as argv entries, not environment variables. The setup.sh script expects K8S_VERSION via environment variable (line 23: k8s_version="${K8S_VERSION:-$DEFAULT_K8S_VERSION}"). As written, the cluster will always use DEFAULT_K8S_VERSION regardless of expectedVersion.

Fix
-		scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType, "K8S_VERSION="+expectedVersion)
+		scriptCmd := exec.Command("bash", "deploy/kind-emulator/setup.sh", "-g", strconv.Itoa(maxGPUs), "-t", gpuType)
+		scriptCmd.Env = append(os.Environ(), "K8S_VERSION="+expectedVersion)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/utils/e2eutils.go` at line 316, The exec.Command call that creates
scriptCmd is passing "K8S_VERSION="+expectedVersion as a positional argv instead
of an environment variable; update the code around the scriptCmd creation so the
command arguments are only "bash", "deploy/kind-emulator/setup.sh", "-g",
strconv.Itoa(maxGPUs), "-t", gpuType (remove the "K8S_VERSION=..." arg) and set
the environment on scriptCmd (e.g., scriptCmd.Env = append(os.Environ(),
"K8S_VERSION="+expectedVersion)) so the setup.sh reads K8S_VERSION via the
environment as intended.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants